Preparation and assumed knowledge

  • High dimensional viz content in Module 4.
  • Listen to the Week 4 lecture pre-recording.
  • Data files
    • movielens_top40.csv from Canvas
    • author_count.csv from Canvas

Aims

  • Explore decompositions of data using
    • different PCA calibrations
    • different clustering calibrations using \(k\)-means and hierarchical clustering.
    • different representations of the data using \(t\)-SNE and MDS
  • Create a visualizations using PCA, t-SNE and basic MDS
  • Understand the difference between clustering algorithms and data visualization.


1 Movie ratings data

We will be analysing the MovieLens dataset which contains movie ratings of 58,000 movies by 280,000 users. The entire dataset is too big for us to work with in this lab. It has been preprocessed with only a small subset of the data being considered. If you want to do more exploration yourself, the entire dataset can be downloaded here.

This part of the lab is based on a chapter in an online book by Rafael Irizarry. You can find it here. There are lots of examples in this book to show you how to use R for data science.

1.1 Data processing [optional]

This part of the code is for interested students only. You do not need this for the lab.

# Here is the code used to preprocess the data (taken from the Irizarry lab): 
library(dplyr)
library(tidyr)
ratings <- read.csv("ml-latest-small/ratings.csv", header = TRUE)
movies <- read.csv("ml-latest-small/movies.csv", header = TRUE)
movielens <- left_join(movies, ratings)

top <- movielens %>%
  group_by(movieId) %>%
  summarize(n=n(), title = first(title)) %>%
  top_n(40, n) %>%
  pull(movieId)

x <- movielens %>% 
  filter(movieId %in% top) %>%
  group_by(userId) %>%
  filter(n() >= 20) %>%
  ungroup() %>% 
  select(title, userId, rating) %>%
  spread(userId, rating)
x <- as.data.frame(x)
rownames(x) <- x$title
x$title <- NULL
colnames(x) <- paste0("user_", colnames(x))


write.table(x, row.names = TRUE, col.names = TRUE, sep = ",", file = "movielens_top40.csv")

1.2 Data input and IDA

Load the data movielens_top40.csv into R. It contains the top 40 movies with the most ratings and users who rated at least 20 out of the 40 movies. Note, IDA refers to initial data analysis. This is important component for all data analytics.

movielens <- read.csv("movielens_top40.csv", header = TRUE)
dim(movielens) 
## [1]  40 153
print(movielens[1:5,1:5])
##                           user_1 user_6 user_7 user_15 user_17
## Aladdin (1992)                NA      5    3.0       3      NA
## American Beauty (1999)         5     NA    4.0       4     4.0
## Apollo 13 (1995)              NA      4    4.5      NA     3.5
## Back to the Future (1985)      5     NA    5.0       5     4.5
## Batman (1989)                  4      3    3.0      NA     4.5
head(movielens)
##                           user_1 user_6 user_7 user_15 user_17 user_18 user_19
## Aladdin (1992)                NA      5    3.0       3      NA     3.5       3
## American Beauty (1999)         5     NA    4.0       4     4.0      NA       4
## Apollo 13 (1995)              NA      4    4.5      NA     3.5      NA      NA
## Back to the Future (1985)      5     NA    5.0       5     4.5     4.0       4
## Batman (1989)                  4      3    3.0      NA     4.5      NA       5
## Braveheart (1995)              4      5     NA      NA     4.5     4.5      NA
##                           user_21 user_28 user_39 user_42 user_45 user_57
## Aladdin (1992)                4.0      NA       4      NA     5.0       4
## American Beauty (1999)        2.0     4.0       5      NA     5.0       5
## Apollo 13 (1995)               NA      NA      NA       5     5.0       3
## Back to the Future (1985)     5.0      NA       4       4     3.5       4
## Batman (1989)                 3.5     2.5       4       3      NA       4
## Braveheart (1995)              NA     3.5      NA       4     5.0       4
##                           user_58 user_62 user_63 user_64 user_66 user_68
## Aladdin (1992)                  5      NA     4.0     4.0      NA     3.5
## American Beauty (1999)         NA      NA     5.0     2.5       5     5.0
## Apollo 13 (1995)                4      NA     3.0      NA      NA     3.0
## Back to the Future (1985)      NA     4.5     5.0      NA       3     3.0
## Batman (1989)                   3      NA     4.0      NA       4     4.0
## Braveheart (1995)               5     4.5     2.5     4.0       5     2.5
##                           user_72 user_82 user_84 user_86 user_91 user_96
## Aladdin (1992)                 NA     2.5      NA       4     3.5      NA
## American Beauty (1999)        4.5      NA      NA       4      NA       5
## Apollo 13 (1995)              4.0      NA       5      NA     3.5       5
## Back to the Future (1985)     4.0     4.0       3      NA     3.5      NA
## Batman (1989)                  NA     3.5       3      NA     5.0      NA
## Braveheart (1995)             4.5     4.5      NA      NA     4.0       5
##                           user_103 user_105 user_109 user_112 user_115 user_117
## Aladdin (1992)                  NA       NA        3       NA        4        4
## American Beauty (1999)          NA      5.0       NA       NA        1       NA
## Apollo 13 (1995)               4.0       NA        3      4.0       NA        4
## Back to the Future (1985)       NA       NA       NA      4.0       NA       NA
## Batman (1989)                   NA       NA        4       NA        5        3
## Braveheart (1995)              4.5      3.5        5      3.5        3        5
##                           user_122 user_132 user_135 user_137 user_140 user_141
## Aladdin (1992)                  NA      3.5       NA      4.0        3      4.0
## American Beauty (1999)          NA      4.5        4       NA        4       NA
## Apollo 13 (1995)                NA       NA       NA      3.5        5      3.5
## Back to the Future (1985)      5.0      3.5       NA      3.5        3      2.5
## Batman (1989)                  4.5      2.0        5       NA       NA       NA
## Braveheart (1995)               NA       NA        4      4.0        4      3.5
##                           user_144 user_156 user_160 user_166 user_167 user_177
## Aladdin (1992)                 4.5       NA       NA      5.0      3.0        4
## American Beauty (1999)         4.0      4.5        5      4.0      3.0        4
## Apollo 13 (1995)               3.0      4.0        5       NA      4.0        4
## Back to the Future (1985)       NA      3.5        5       NA       NA        5
## Batman (1989)                  3.5       NA        4      3.5      3.0        3
## Braveheart (1995)              4.5       NA        4       NA      3.5       NA
##                           user_178 user_179 user_182 user_186 user_187 user_195
## Aladdin (1992)                  NA       NA       NA        5       NA       NA
## American Beauty (1999)         5.0       NA      5.0       NA        4        4
## Apollo 13 (1995)                NA        4      2.5       NA       NA        4
## Back to the Future (1985)      4.5       NA      3.0       NA       NA        5
## Batman (1989)                   NA        3      3.5        4       NA       NA
## Braveheart (1995)              4.0        5      3.5       NA        3       NA
##                           user_198 user_199 user_200 user_201 user_202 user_212
## Aladdin (1992)                  NA       NA      4.0       NA        4       NA
## American Beauty (1999)           5        5      3.5        5        4      3.5
## Apollo 13 (1995)                NA        4      4.0        4        4       NA
## Back to the Future (1985)        5       NA      4.0        5        4       NA
## Batman (1989)                    3        3       NA        3        3       NA
## Braveheart (1995)                3       NA      4.5       NA        4       NA
##                           user_217 user_219 user_220 user_226 user_230 user_232
## Aladdin (1992)                  NA      4.5        5      4.0        2      3.0
## American Beauty (1999)          NA      5.0       NA      4.0       NA       NA
## Apollo 13 (1995)                NA      4.0        5      4.5        2      4.5
## Back to the Future (1985)        3      3.5        5      4.0       NA      3.0
## Batman (1989)                    2      3.5       NA       NA        3       NA
## Braveheart (1995)                2       NA       NA       NA       NA      4.5
##                           user_233 user_239 user_247 user_249 user_254 user_263
## Aladdin (1992)                  NA      4.0        5      4.0       NA       NA
## American Beauty (1999)           3      5.0        4      4.5      5.0        4
## Apollo 13 (1995)                 2       NA        3      2.5      4.0        4
## Back to the Future (1985)       NA       NA        4      4.5      3.5       NA
## Batman (1989)                   NA       NA       NA       NA      2.5       NA
## Braveheart (1995)                3      4.5        4      5.0      4.0        4
##                           user_266 user_274 user_275 user_279 user_282 user_288
## Aladdin (1992)                  NA      4.0       NA      2.0      4.5        4
## American Beauty (1999)          NA      5.0        4      3.5      4.5       NA
## Apollo 13 (1995)                NA       NA       NA       NA      4.5        3
## Back to the Future (1985)        4      3.5        4      3.5      5.0        5
## Batman (1989)                    4      3.0       NA       NA      3.5        3
## Braveheart (1995)                5      4.5       NA      4.0       NA        5
##                           user_292 user_298 user_304 user_305 user_307 user_308
## Aladdin (1992)                 4.0       NA        4       NA      4.0       NA
## American Beauty (1999)          NA      4.0        2      5.0      4.0       NA
## Apollo 13 (1995)                NA       NA        5       NA      2.0       NA
## Back to the Future (1985)      4.0      3.5        5      5.0      4.0       NA
## Batman (1989)                  3.5      3.5       NA      2.5      4.0       NA
## Braveheart (1995)              2.5      3.0        5       NA      3.5        1
##                           user_313 user_314 user_317 user_318 user_322 user_328
## Aladdin (1992)                  NA        3       NA       NA       NA      3.5
## American Beauty (1999)           4       NA        5      3.5      4.5       NA
## Apollo 13 (1995)                NA        4        3       NA      4.0      3.0
## Back to the Future (1985)        2       NA       NA      2.5       NA      4.0
## Batman (1989)                    5        3       NA       NA       NA      2.0
## Braveheart (1995)               NA        4        5       NA      3.5      1.0
##                           user_330 user_332 user_334 user_339 user_352 user_354
## Aladdin (1992)                 3.0       NA       NA       NA       NA      3.5
## American Beauty (1999)         4.5      4.5       NA      5.0        5      4.0
## Apollo 13 (1995)               3.0      3.5       NA      4.0       NA      4.0
## Back to the Future (1985)      4.0      4.0      3.5      4.0       NA      4.0
## Batman (1989)                  4.0       NA       NA      2.5       NA      4.0
## Braveheart (1995)              3.5      3.5       NA       NA       NA       NA
##                           user_357 user_362 user_368 user_370 user_372 user_376
## Aladdin (1992)                 4.5       NA       NA       NA        4       NA
## American Beauty (1999)         3.5       NA        4      3.5       NA       NA
## Apollo 13 (1995)               3.5       NA       NA       NA        3      5.0
## Back to the Future (1985)      4.0       NA       NA       NA        5      4.5
## Batman (1989)                  3.0      4.5        3      4.0        3       NA
## Braveheart (1995)              4.0      4.0        4       NA        4      3.5
##                           user_380 user_381 user_382 user_385 user_387 user_391
## Aladdin (1992)                   5      4.0        5        4      2.5       NA
## American Beauty (1999)          NA       NA       NA       NA      4.5        4
## Apollo 13 (1995)                NA      3.5        4        5       NA        4
## Back to the Future (1985)        5      4.0       NA        4      2.0        4
## Batman (1989)                    3       NA       NA        3      4.0        4
## Braveheart (1995)                4       NA       NA       NA      3.5        5
##                           user_399 user_414 user_415 user_425 user_428 user_432
## Aladdin (1992)                  NA        4      4.0      3.0      2.0       NA
## American Beauty (1999)         0.5        5      3.5      3.0      3.5      3.5
## Apollo 13 (1995)                NA        4      4.0      3.0      2.0       NA
## Back to the Future (1985)      5.0        5       NA       NA       NA       NA
## Batman (1989)                   NA        4       NA      3.5      3.0       NA
## Braveheart (1995)              3.0        5       NA      4.0      2.5      4.0
##                           user_434 user_438 user_448 user_452 user_453 user_462
## Aladdin (1992)                 4.0      4.0       NA       NA        5       NA
## American Beauty (1999)         5.0       NA        4        4        5      3.5
## Apollo 13 (1995)               5.0      4.0        3       NA       NA       NA
## Back to the Future (1985)      3.5      4.0        5        4       NA      1.5
## Batman (1989)                   NA      4.0        3        5       NA      3.0
## Braveheart (1995)              4.5      4.5       NA        5        5       NA
##                           user_464 user_469 user_470 user_474 user_477 user_480
## Aladdin (1992)                  NA        2        3      4.0      3.0      4.0
## American Beauty (1999)           4        5       NA      3.5      4.5      4.0
## Apollo 13 (1995)                NA       NA        3      4.5      4.0      3.5
## Back to the Future (1985)       NA        3       NA      4.5      4.5      5.0
## Batman (1989)                   NA        3        3      4.0       NA      4.5
## Braveheart (1995)                5        5        5      3.0       NA      5.0
##                           user_483 user_489 user_514 user_517 user_522 user_524
## Aladdin (1992)                 4.0      3.5      4.0      3.0        4        4
## American Beauty (1999)         4.0      4.0      4.0      1.0        5       NA
## Apollo 13 (1995)               2.0      3.5      4.0       NA       NA        5
## Back to the Future (1985)      4.5      3.5      5.0      5.0        5        5
## Batman (1989)                  3.5      4.0      2.5      3.0       NA        3
## Braveheart (1995)              4.0      4.5       NA      1.5        4        3
##                           user_525 user_534 user_551 user_555 user_559 user_560
## Aladdin (1992)                 3.5      4.5       NA       NA        4       NA
## American Beauty (1999)         4.0      3.5       NA        5       NA        4
## Apollo 13 (1995)               4.0       NA       NA        4        3        4
## Back to the Future (1985)      4.0      5.0      4.0        3       NA       NA
## Batman (1989)                   NA      4.0       NA        3        3       NA
## Braveheart (1995)               NA       NA      3.5        5        4        4
##                           user_561 user_562 user_570 user_573 user_577 user_580
## Aladdin (1992)                  NA        4       NA      4.5       NA      2.0
## American Beauty (1999)         3.5        5      4.0      2.0       NA      5.0
## Apollo 13 (1995)                NA        3      4.0      3.0       NA       NA
## Back to the Future (1985)      4.5       NA      4.0      4.5        5      3.5
## Batman (1989)                  4.5       NA       NA      4.5        2      3.0
## Braveheart (1995)              5.0        4      3.5      5.0        4      4.5
##                           user_586 user_590 user_593 user_594 user_596 user_597
## Aladdin (1992)                 4.5      4.0      3.5      4.5       NA        4
## American Beauty (1999)          NA      3.0      4.5       NA       NA        5
## Apollo 13 (1995)                NA      4.5      3.0      3.5      3.5       NA
## Back to the Future (1985)      4.5      4.5       NA       NA      4.0        5
## Batman (1989)                   NA      3.5       NA      4.5      3.5        4
## Braveheart (1995)              5.0      4.0      3.0      5.0       NA        5
##                           user_599 user_600 user_602 user_603 user_606 user_607
## Aladdin (1992)                 3.0      3.5       NA       NA       NA       NA
## American Beauty (1999)         5.0      4.5       NA        5      4.5        3
## Apollo 13 (1995)               2.5      2.0        4       NA       NA        5
## Back to the Future (1985)      3.5      4.5       NA        2      3.5        3
## Batman (1989)                  3.5      2.5        4        2      3.5        3
## Braveheart (1995)              3.5      2.0        5        1      3.5        5
##                           user_608 user_610
## Aladdin (1992)                   3       NA
## American Beauty (1999)           5      3.5
## Apollo 13 (1995)                 2       NA
## Back to the Future (1985)        2      5.0
## Batman (1989)                    3      4.5
## Braveheart (1995)                4      4.5

1.3 Hierarchical clustering

Given the large amount of variables, a natural high-dimensional visualization method is to cluster the movies based on different user ratings. We will look at how to do this in R.

  1. Basic hclust usage

Perform hierarchical clustering using the hclust() function and plot the resulting dendrogram. Try it with the average, complete and single methods.

d <- dist(movielens)
h <- hclust(d)
plot(h, cex = 0.4)

h_avg <- hclust(d, method = "average")
plot(h_avg, cex = 0.4)

h_single <- hclust(d, method = "single")
plot(h_single, cex = 0.4)

  1. Form clusters in hclust

Use the cutree() function on the output of hclust() (with default settings) to separate the movie titles into four clusters. Can you extract the movies in cluster 1? We can also cut the tree by defining a height at which the tree should be cut. Can you find the value of h to cut the tree into four clusters?

movie_groups <- cutree(h, k = 4) 
which(movie_groups == 1)
##            Aladdin (1992) Back to the Future (1985)     Lion King, The (1994) 
##                         1                         4                        16 
##              Shrek (2001)          Toy Story (1995) 
##                        29                        37
summary(cutree(h, h = 16)) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   2.375   3.000   4.000
  1. You may have noticed that not every movie has a rating by every user. This makes sense since no one could have possibly watched every movie. One question you may ask is whether the clustering result is based on the actual number in the rating (of 1 to 5 stars), or whether it’s clustering for the existence of a rating. Make a new dataset by replacing all missing ratings (ie. the NAs) with 0, and all the ratings (regardless of value) with 1. And then repeat the hierarchical clustering, but this time use the Manhattan distance. Use cutree to find 4 clusters and compare to your result in the previous question.
movielens_mat <- as.matrix(movielens)
movielens_mat[which(is.na(movielens_mat))] <- 0 
movielens_mat[which(movielens_mat > 0)] <- 1

d_man <- dist(movielens_mat, method = "manhattan") 
h_man <- hclust(d_man) 
plot(h_man, cex = 0.5)

movie_groups_man <- cutree(h_man, k = 4) 

1.4 Visulize the data [Optional]

R also offers a number of packages that enable the user to visualize the data together with the clustering tree. We call these visualizations “heatmaps” of the data matrix. Download and install the package ComplexHeatmap using the code provided below and we will need to ensure the input is a matrix as expected by the function Heatmap. The arguments row_names_gp and column_names_gp enable us to reduce the font size.

# BiocManager::install("ComplexHeatmap")
# BiocManager::install("shape")
library(ComplexHeatmap)
movielens_matrix <- as.matrix(movielens)  
movielens_matrix <- as.matrix(movielens)  
library(ComplexHeatmap)
movielens_matrix <- as.matrix(movielens)  
Heatmap(movielens_matrix, 
        row_names_gp = gpar(fontsize = 7),
        column_names_gp = gpar(fontsize = 7))

1.5 Comparing trees [Optional]

Suppose we like to compare the effect of two trees and visualize it. R has a package called dendextend that compare two dendrograms, it has the following key functions - untangle(): finds alignment, - tanglegram(): visualise the two dendrograms, - entanglement(): computes the quality of the alignment.

library(dendextend)

# Create two dendrograms
h_avg <- hclust(d, method = "average")
h_single <- hclust(d, method = "single")
dend1 <- as.dendrogram (h_avg)
dend2 <- as.dendrogram (h_single)

# Create a list to hold dendrograms
dend_list <- dendlist(dend1, dend2)

# Compare the two trees
tanglegram(dend_list)

1.6 \(k\)-means

  1. Basic \(k\)-means usage

Next, let’s explore the kmeans method. Go back to the original movies dataset with ratings between 1 to 5 and missing values, let’s now make a new dataset replacing all the NAs with 0 but keep the ratings. We are doing this because the kmeans function cannot handle missing values. In a later module, we will look at how to handle missing values. Use kmeans to cluster the movies into four clusters. How many movies are in each cluster?

movielens_mat <- as.matrix(movielens)
movielens_mat[is.na(movielens_mat)] <- 0 
kmeans_res <- kmeans(movielens_mat, centers = 4)
table(kmeans_res$cluster) 
## 
##  1  2  3  4 
## 10  9 15  6
  1. To visualize results from the kmeans clustering, use a dimension reduction technique such as PCA.
movie_pc = prcomp(movielens_mat, scale = TRUE)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
movie.df = data.frame(PC1 = movie_pc$x[,1], PC2 = movie_pc$x[,2], labels = factor(kmeans_res$cluster))
ggplot(movie.df, aes(PC1, PC2, col = labels)) + geom_point() + theme_minimal()

1.7 Cluster statistics

Let’s now look at the cluster statistics. Can you plot the total within group sum of squares for k = 2, 3, 4, 5, 6 from kmeans(). The tot.withinss is part of the output value of kmeans. Repeat for between group sum of squares (betweenss). Do the plots hint at what is the best k?

set.seed(5003)
kmeans_2 <- kmeans(movielens_mat, centers = 2)
kmeans_3 <- kmeans(movielens_mat, centers = 3)
kmeans_4 <- kmeans(movielens_mat, centers = 4)
kmeans_5 <- kmeans(movielens_mat, centers = 5)
kmeans_6 <- kmeans(movielens_mat, centers = 6)

tot.withinss <- c(kmeans_2$tot.withinss, kmeans_3$tot.withinss,
                  kmeans_4$tot.withinss, kmeans_5$tot.withinss,
                  kmeans_6$tot.withinss)

betweenss <- c(kmeans_2$betweenss, kmeans_3$betweenss,
               kmeans_4$betweenss, kmeans_5$betweenss,
               kmeans_6$betweenss)

# or more directly using the apply suite over a larger range
set.seed(5003)
center.seq <- 2:39
kmeans <- lapply(center.seq, function(x) kmeans(movielens_mat, centers = x))
tot.within.ss <- sapply(kmeans, "[[", "tot.withinss")
between.ss <- sapply(kmeans, "[[", "betweenss")


plot(center.seq, tot.within.ss, xlab = "Number of clusters", main = "Within group SS")

plot(center.seq, between.ss, xlab = "Number of clusters", main = "Between group SS")

2 Author by word count

The next dataset author_count.csv shows the counts of common words appearing in documents by four authors, Jane Austen, Jack London, William Shakespeare and John Milton. We like to investigate whether clustering based word characterstics is able to split the four authors apart. Here the first column shows the author, the remaining columns show the counts of each word.

2.1 Data input

author.dat <- read.csv("author_count.csv", header = TRUE)
numeric.dat <- author.dat[-1]
authors <- factor(author.dat[[1]])

2.2 PCA

Compute the PCA and visualize the output.

pca.scaled <- prcomp(numeric.dat, scale = TRUE)
library(gridExtra)
author.df <- data.frame(PC1 = pca.scaled$x[,1], PC2 = pca.scaled$x[,2],
                        PC3 = pca.scaled$x[,3], labels = authors)
pca.plot <- ggplot(author.df, aes(PC1, PC2, col = labels)) + geom_point() + theme_minimal()
pca.plot

2.3 t-SNE

Compute and view the \(t\)-SNE plots for various perplexity levels for this dataset. Here you will need to consider adjusting the perplexity values.

Solution

library(Rtsne)
library(ggpubr)
set.seed(5003)
perplexity <- c(5, 10, 20, 50)
rtsne <- lapply(perplexity, function(x) {
  y <- Rtsne(numeric.dat, dims = 2, perplexity = x)$Y
  attr(y, "perplexity") <- x
  y
  })

tsne.plots <- lapply(rtsne, function(dat) {
  perplexity <- attr(dat, "perplexity")
  dat <- as.data.frame(dat)
  names(dat) <- c("x", "y")
  dat[["author"]] <- authors
  ggplot(dat) + geom_point(aes(x = x, y = y, colour = author)) +
    ggtitle(paste0("Perplexity = ", perplexity))
})

ggarrange(plotlist = tsne.plots, common.legend = TRUE)

2.4 MDS

  1. Consider the MultiDimensionalScaling (MDS) technique to visualize the data. Compute different distance matrices using the dist function for the author_count dataset.

Solution

dist.types <- c("euclidean", "maximum", "manhattan", "canberra", "binary",  "minkowski")
dist.matrices <- lapply(dist.types, function(x) {
  y <- dist(numeric.dat, method = x)
  attr(y, "method") <- x
  y
})
  1. Create the MDS plot in 2 dimensions and colour the plot by the true author.

Solution

mds.out <- lapply(dist.matrices, function(x) {
  y <- cmdscale(x)
  attr(y, "method") <- attr(x, "method")
  y
})

mds.plots <- lapply(mds.out, function(dat) {
  method <- attr(dat, "method")
  dat <- as.data.frame(dat)
  names(dat) <- c("x", "y")
  dat[["author"]] <- authors
  ggplot(dat) + geom_point(aes(x = x, y = y, colour = author)) + ggtitle(method)
})

ggarrange(plotlist = mds.plots, common.legend = TRUE)

2.5 Compare and contrast

Select the best result in each case for PCA, \(t\)-SNE and MDS and compare.

Solution

mds.plot <- mds.plots[[which(dist.types == "canberra")]]
tsne.plot <- tsne.plots[[which(perplexity == 20)]]
ggarrange(plotlist = list(pca.plot, tsne.plot, mds.plot), common.legend = TRUE)

The Canberra metric appears to be the best for MDS. The \(t\)-SNE plot seems to do the best to discriminate between the authors into well separated clusters with tight grouping around their centers. However, there seems to be some rare points where \(t\)-SNE has placed the point in the wrong cluster. The MDS plot seems to not suffer from this but the clusters are not as well separated. The PCA while it does an admirable job, it the least favourable to explain the data visually.

3 Shiny app to allow the user to explore and decide

Create a shiny app for the author_count data which gives the user options to decide which visualization technique to use and calibrate it with any necessary parameters .